This script describes analysis on the ensemble of models generated by ReplicaExchange macro. See cluster analysis tutorial for the first part of the script and how to run it.


In [ ]:
import IMP
import IMP.pmi
import IMP.pmi.macros

In [ ]:
is_mpi=True  #run in parallel and requires mpi4py

model=IMP.Model()

First, construct the analysis class by indicating where the files are and whether you want to merge different runs together:

merge_directories is a list of directory names containing the different runs you want to merge together in the same analysis.

rmf_dir is the name of the directory where to find the rmfs

global_output_directory if the name of the directory where to find the rmfs and the stat files within the merge_directories

stat_file_name_suffix is the suffix name for the stat files in the directory(ies) /merge_directories/global_output_directory/


In [ ]:
mc=IMP.pmi.macros.AnalysisReplicaExchange0(model,
                  stat_file_name_suffix="stat",
                  merge_directories=["replica_exchange_directory_1","replica_exchange_directory_2"],
                  global_output_directory="/output/",
                  rmf_dir="rmfs/")

Second, list the features you want to extract from the stat files


In [ ]:
feature_list=["ISDCrossLinkMS_Distance_intrarb",
              "ISDCrossLinkMS_Distance_interrb",
              "ISDCrossLinkMS_Data_Score",
              "GaussianEMRestraint_None",
              "SimplifiedModel_Linker_Score_None",
              "ISDCrossLinkMS_Psi_1.0_",
              "ISDCrossLinkMS_Sigma_1_"]

Last, setup the clustering.

skip_clustering=True it will actually not perform the clustering at all, but it will exctract the best scoring models

prefiltervalue=300 this value indicates the upper-bound score value. It is needed for memory efficiency.

number_of_best_scoring_models number of best scoring models to be extracted

first_and_last_frames=[0,0.5] what portion of the trajectory you want to perform the analysis. For example, [0,0.5] means that you want to extract the best scoring models from the first half of the ensemble. [0.5,1.0] means from the second half.


In [ ]:
mc.clustering("SimplifiedModel_Total_Score_None",
              "rmf_file",
              "rmf_frame_index",
              prefiltervalue=300,
              number_of_best_scoring_models=100,
              skip_clustering=True,
              feature_keys=feature_list,
              first_and_last_frames=[0,0.5],
              is_mpi=is_mpi,
              get_every=1)